42 research outputs found

    Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons

    Get PDF
    In just the last decade, a multitude of bio-technologies and software pipelines have emerged to revolutionize genomics. To further their central goal, they aim to accelerate and improve the quality of de novo whole-genome assembly starting from short DNA reads. However, the performance of each of these tools is contingent on the length and quality of the sequencing data, the structure and complexity of the genome sequence, and the resolution and quality of long-range information. Furthermore, in the absence of any metric that captures the most fundamental "features" of a high-quality assembly, there is no obvious recipe for users to select the most desirable assembler/assembly. International competitions such as Assemblathons or GAGE tried to identify the best assembler(s) and their features. Some what circuitously, the only available approach to gauge de novo assemblies and assemblers relies solely on the availability of a high-quality fully assembled reference genome sequence. Still worse, reference-guided evaluations are often both difficult to analyze, leading to conclusions that are difficult to interpret. In this paper, we circumvent many of these issues by relying upon a tool, dubbed FRCbam, which is capable of evaluating de novo assemblies from the read-layouts even when no reference exists. We extend the FRCurve approach to cases where lay-out information may have been obscured, as is true in many deBruijn-graph-based algorithms. As a by-product, FRCurve now expands its applicability to a much wider class of assemblers -- thus, identifying higher-quality members of this group, their inter-relations as well as sensitivity to carefully selected features, with or without the support of a reference sequence or layout for the reads. The paper concludes by reevaluating several recently conducted assembly competitions and the datasets that have resulted from them.Comment: Submitted to PLoS One. Supplementary material available at http://www.nada.kth.se/~vezzi/publications/supplementary.pdf and http://cs.nyu.edu/mishra/PUBLICATIONS/12.supplementaryFRC.pd

    Feature-by-Feature – Evaluating De Novo Sequence Assembly

    Get PDF
    The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the “excess-dimensionality” of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art simulators, lead to not-so-realistic results

    Comparing De Novo Genome Assembly: The Long and Short of It

    Get PDF
    Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers – both for low-coverage long (Sanger) and high-coverage short (Illumina) reads technologies – are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing “next-generation” assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium

    Unexpected frequency of the pathogenic AR CAG repeat expansion in the general population

    Get PDF
    CAG repeat expansions in exon 1 of the AR gene on the X chromosome cause spinal and bulbar muscular atrophy, a male-specific progressive neuromuscular disorder associated with a variety of extra-neurological symptoms. The disease has a reported male prevalence of 1:30,303 or less, but the AR repeat expansion frequency is unknown. We established a pipeline, which combines the use of the ExpansionHunter tool and visual validation, to detect AR CAG expansion on whole-genome sequencing data, benchmarked it to fragment PCR sizing, and applied it to 74,277 unrelated individuals from four large cohorts. Our pipeline showed sensitivity of 100% (95% C.I. 90.8-100%), specificity of 99% (95% C.I. 94.2-99.7%), and positive predictive value of 97.4% (95% C.I. 84.4-99.6%). We found the mutation frequency to be 1:3,182 (95% C.I. 1:2,309-1:4,386, n=117,734) X chromosomes - ten times more frequent than the reported disease prevalence. Modelling using the novel mutation frequency led to estimate disease prevalence of 1:6,887 males, more than four times more frequent than the reported disease prevalence. This discrepancy is possibly due to underdiagnosis of this neuromuscular condition, reduced prevalence, and/or pleomorphic clinical manifestations

    A crowdsourced set of curated structural variants for the human genome.

    Get PDF
    Funder: U.S. Food and Drug Administration; funder-id: http://dx.doi.org/10.13039/100000038A high quality benchmark for small variants encompassing 88 to 90% of the reference genome has been developed for seven Genome in a Bottle (GIAB) reference samples. However a reliable benchmark for large indels and structural variants (SVs) is more challenging. In this study, we manually curated 1235 SVs, which can ultimately be used to evaluate SV callers or train machine learning models. We developed a crowdsourcing app-SVCurator-to help GIAB curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy. SVCurator displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002]. We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. 'Expert' curators were 93% concordant with each other, and 37 of the 61 curators had at least 78% concordance with a set of 'expert' curators. The curators were least concordant for complex SVs and SVs that had inaccurate breakpoints or size predictions. After filtering events with low concordance among curators, we produced high confidence labels for 935 events. The SVCurator crowdsourced labels were 94.5% concordant with the heuristic-based draft benchmark SV callset from GIAB. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies

    Stracquadanio G: Robust Bio-active Peptide Prediction Using Multi-objective Optimization

    No full text
    Abstract-Bio-active peptides control many important functions in organisms, such as cell reproduction, appetite, euphoria, sleep, learning, immune response, etc. They also act on hormones, neurotransmitters, antioxidants, toxins and antibiotics. Because of their importance, bioactive peptides have received particular attention and extensive studies have been carried out to determine their structures. Although their typical size does not exceed 30 amino acids, their 3D structure is challenging to predict because of the following reasons: (i) their conformation often includes β-turns which are more difficult to predict using standard potential energy functions; (ii) they fold into structures that are not similar to already known proteins, which makes them hard instances for comparative modeling techniques; (iii) they are more exposed to the solvent than longer proteins and this additional effect has a consequence on their final conformation. This paper presents a strategy for peptides structure prediction that uses: (1) a multi-objective formulation of the optimization problem, (2) a multi-objective evolutionary algorithm to explore the search space, (3) a decision making phase based on different metrics to select solution from the Pareto front, and (4) a method to analyze the robustness of the solution using the Monte Carlo method. We have tested this prediction pipeline on a large dataset of 43 bioactive peptides and the experimental results show that this method outperforms the PEPstr prediction server and is competitive against a more recent Generalized Pattern Search approach. Multiple solutions can be generated, as opposed to standard single-objective methods, which are generally more robust than the wild-type

    MODELING AND SIMULATION OF E-MAIL SOCIAL NETWORKS: A NEW STOCHASTIC AGENT-BASED APPROACH

    No full text
    Understanding how the structure of a network evolves over time is one of the most interesting and complex topics in the field of social networks. In our attempt to model the dynamics of such systems, we explore an agent-based approach to model growth of email-based social networks, in which individuals establish, maintain and allow atrophy of links through contact-lists and emails. The model is based on the idea of common neighbors, but also on a detailed specialization of the classical preferential attachment theory, thus capturing a deeper understanding of the topology of inter-node connections. In our event-based simulation that schedules the agents ’ actions over time, the proposed model is amenable to significant efficiency improvements through an application of the Gillespie stochastic simulation schemes. Computer simulation results are used to validate the model by showing that its unique features endow it with ability to simulate real-world email networks with convincing realism

    FRCurve computed on the three GAGE datasets and on Assemblathon 1 entries.

    No full text
    <p>Figures A, B, and C show the FRCurves computed on the three GAGE datasets (<i>Staphylococcus aureus</i>, <i>Rhodobacter sphaeroides</i>, and Human chromosome 14). Figure D shows the FRCurves computed on Assemblathon 1 entries.</p

    Feature Response Curve and ICA features: Real Short Reads.

    No full text
    <p>Figure A shows the FRC for the 5 assemblers on <i>E. coli</i> real dataset (read length bp, insert size bp and coverage ) when using all the feature space. Figure B shows the FRC computed on the ICA-selected features.</p

    Dotplot validation of the longest scaffolds produced by Allpaths-LG and MSR-CA on <i>Staphylococcus a.</i> dataset.

    No full text
    <p>Figures A and B show the dotplot of the longest scaffolds produced by Allpaths-LG and MSR-CA against the reference genome.</p
    corecore